arXiv: math. PR/0000000 



Stein's density approach for discrete 
distributions and information 
inequalities 

Christophe Ley 1 and Yvik Swan 2 

Universite Libre de Bruxelles 
Departement de Mathematique 
Boulevard du Triomphe 
Campus Plaine - CP 210 
B-1050 Brussels 
e-mail: chrisley@ulb.ac.be 

Universite du Luxembourg 
Faculte des Sciences, de la Technologie et de la Communication 
Unite de Recherche en Mathematiques 
6, rue Richard Coudenhove-Kalergi 
L-1359 Luxembourg 
e-mail: yvik. swanOuni . lu 

Abstract: Wo inscribe Stein's density approach for discrete distributions 
in a new, flexible framework, hereby extending and unifying a large portion 
of the relevant literature. We use this to derive a Stein identity whose power 
we illustrate by obtaining a wide variety of so-called inequalities between 
probability metrics and information functionals. Whenever competitor in- 
equalities are available in the literature, the constants in ours are better. 
We also argue that our inequalities are local versions of the famous Pinsker 
inequality. 

AMS 2000 subject classifications: Primary 60K35; secondary 94A17. 
Keywords and phrases: Discrete density approach, Poisson approxima- 
tion, scaled Fisher information, Stein characterizations, Total variation dis- 
tance. 



Foreword 

This paper is the follow-up in the discrete setting of our contribution [18]. Here 
we pursue our exploration of the way in which Stein's method blends into the 
framework of information theory, this time for discrete distributions. Similarly as 
in the absolutely continuous setting for Gaussian approximation via information- 
theoretic tools and Charles Stein's characterization of the Gaussian, the con- 
nection between information theory and Louis Chen's characterization of the 
Poisson distribution (see equation (1.3)) has also already been noted and used 
by a number of authors (including [3, 13, 21]). These results are nevertheless 
tailored for Poisson approximation only and are not easily extended to other 
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choices of target distribution. To the best of our knowledge we are the first to 
develop the adequate framework for exploring this powerful connection in all 
generality. 

The present work consists in two parts. First, in Section 1 we construct a very 
flexible version of Stein's so-called "density approach" for discrete distributions 
which, together with its continuous counterpart from [18], allows us to extend 
and unify most of the literature on this topic (see, e.g., [6, 23]). Second, in 
Section 2 we use the construct from Section 1 to derive a natural Stein identity 
from which we derive the claimed connection with information inequalities. Both 
parts contain a self-sufficient and separate introduction as well as a literature 
review. 

1. The discrete density approach 

There exist several natural generalizations of Charles Stein's powerful charac- 
terization of the Gaussian distribution 



X ~ Af(0, 1) E [f'(X) - Xf(X)] = for all bounded / G C^R). (1.1) 



For instance, [12] proposed a mechanism for generating so-called Stein char- 
acterizations for all distributions belonging to the exponential family, while 
[22] proposes a similar construction for all distributions belonging to Pearson's 
family. In 2004, Charles Stein and co-authors introduced what they coined the 
"density approach" in the seminal article [23]. Under several (arguably quite 
strong) conditions on the target density p it is shown there that 



for all / belonging to a certain class of test functions. The operator in (1.2) 
does not rely on an explicit expression for the target density and extends (1.1) 
to all densities whose score function p'(x)/p(x) is well-behaved. This extension 
is also discussed in [5, 6]. There are, however, a number of operators from the 
literature that cannot be written under the form (1.2); hence this form is not 
the most general. In the recent paper [18], we re-interpret (1.1) and (1.2) by 
focusing on the class of test functions for which such results are valid. This 
not only provides operators and characterizations for virtually any absolutely 
continuous distribution on the real line, but also encompasses most of the other 
first-order characterizations from the literature. 

Similarly as for absolutely continuous distributions, there also exist several 
natural generalizations of Louis Chen's characterization of the Poisson(A) dis- 
tribution given by 



X ~ Po(X) E [Xf(X + 1) - Xf(X)} = for all bounded / € N*. (1.3) 



For instance, [22] proposes operators and characterizations for all distributions 
belonging to Ord's family, and [7] develops an extension of (1.3) to Gibbs mea- 
sures (and hence all discrete distributions on the real line). Very recently, [10] 








(1.2) 
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proposed a true counterpart to (1.2) for discrete distributions (similar in spirit 
to - though not inspired by - our anterior construction from an earlier draft of 
the present paper, see [19]) of the form 



A+f(X-l) + ^lf(X) 



(1.4) 



for all / belonging to a certain class of test functions, with A + the forward 
difference operator. Characterizations such as (1.4) arc not as useful as the 
more specified versions provided in [22] or [7], since the denominator in the 
discrete score function A + p(x)/p(x) is likely to be difficult to handle. In [10], the 
authors find a way to overcome this difficulty by considering prc-multiplication 
of the test functions / by some ad-hoc function which provides the required 
simplifications. 

We will now show that an alternative solution is to re-interpret all these 
operators by identifying the minimal conditions on the class of test functions 
for which equations such as (1.4) hold. This will lead us to propose a different, 
more general, operator which will, among other benefits, allow to better handle 
the above described subtleties. More importantly, our approach from this section 
will open connections with information-theoretic tools which we explore in the 
second part of the paper, i.e. Section 2. 



1.1. Framework, definitions and main result 

Let Q be the collection of probability mass functions p : Z — > [0, 1] with support 
S p := {x G Z : p(x) > 0} a discrete interval [a, b] :— {a, a + 1,...,6} for 
a < b G Z U {±oo}. We will, in the sequel, abuse of language by referring 
to probability mass functions as (discrete) densities. Throughout we adopt the 
convention that sums running over empty sets equal 0, and that 

_L- = J ife if x G S * (1.5) 
p(x) \ otherwise. 

Note how, in particular, convention (1.5) implies that p(x)/p{x) = ^-s p ( x )j the 
indicator of the support S p . We will write E p [/(X)] = J2 x es K x )p( x ) f° r V^Q 
and I a p-summable function. Furthermore we introduce the r\- difference oper- 
ator 

A" h(x) = -(h(x + ri) -h(x)) (1.6) 
V 

for all functions h taking their values on Z. Operator (1.6) contains, in particu- 
lar, the more standard forward difference operator (with n = 1) and backward 
difference operator (with n = —1). 

With these notations and conventions, we are able to state the two main 
definitions of this section and to establish our first main result. 
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Definition 1.1. To p G Q and r) G Z we associate (i) the collection J- V (p) of 
functions f : Z — > R such that 

b 

£A"(/0>(j))=0, 

and (it) the operator T£ ■ F^lp) — > Z* : / i— > 7^/ defined through 

TJf :Z->f :ih> 7?/(x) := -^A"(/(x)p(a:)). (1.7) 

We callT^ijf) the class ofrj-test functions associated withp, andT p n the 77-Stein 
operator associated with p. 

Although Definition 1.1 is valid for any 7/ G Z (with, in certain instances, an 
empty class of test functions) and although it is in principle straightforward to 
adapt the forthcoming arguments to various choices of 77, such manipulations 
seem to us unnecessarily ambitious because they bring, to the best of our under- 
standing, little insight into the mechanics behind our results; we will therefore 
henceforth restrict our attention to [ T7 1 = 1. 

Theorem 1.1 (Discrete density approach). Fix \rj\ = 1 and let X be a discrete 
random variable with density p G Q . Let Y be another discrete random variable 
with density q. Then Pi q \T p v f{Y)] = for all f G F(jp) if, and only if, either 
P(Y eS p ) = or P(Y G S p ) > and P(Y < z \ Y G S p ) = P(X < z) for all 
z G S p . 

Proof. If P(Y G S p ) = 0, the equivalence holds trivially so that we can take 
P(Y G S p ) > 0. We first check sufficiency. The equality P(Y < z \ Y G S p ) = 
P(X < z) for all z £ S p can be rewritten as P(Y = z) = P(X = z)P(Y G S p ), 
hence as q{z) = p{z)P{Y G S p ), for all z G S p . Bearing in mind that the operator 
Tp f{x) = for all x ^ S p , the sufficiency is easily established through 

EJV/(Y)] = P(Y G S p ) £ /Y>(f(x)p(x)) = 0, 

the last equality following by definition of the class J^fjp). Next, to see the ne- 
cessity, define, for z G Z, the functions l z (k) := (I(-oo,z]nz(fc) — P(X < z))Is (k) 
for k G Z and define 

x-l 

:Z^R:x^ — - Y^l z {k)p{k) (1.8) 

and 

r i :Z ^ R:iH 1 yy^fc). (1.9) 

Clearly these functions satisfy A v (fP ,T, (x)p(x)) = l z (x)p(x) so that, in particu- 
lar, f™ G r>($) and 

T p vr{x) = i z ( X ) (1.10) 
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for all x G S p . Consequently, for this choice of test function we obtain 

J2 yf™{x)q{x) = J2 h(x)q(x) 
x^Sp x^Sp 

= P(Y < z n Y G S p ) - P(Y G S P )P{X < z), 

which, in combination with the hypothesis E 9 \Tpfz ,r, (Y)\ = finally yields 
P(y < z | y e Sp) = P(X < z) for all z G Sp, whence the claim. □ 

Remark 1.1. There are many alternative ways of proving Theorem 1.1; among 
the more elegant are those which rely on unicity properties of the operators (1.7) 
w.r.t. integration in p, as for instance in the very general setting proposed in 
[17]. We nevertheless prefer the route presented above because it also allowed us 
to introduce several staples from Stein's method : the Stein equations (1.10) and 
their solutions (1.8) and (1.9). Both concepts will play an important role in the 
sequel. 

Remark 1.2. Note that the choice of a "connected" support S p is for conve- 
nience only, and straightforward arguments allow to adapt the result for instance 
to supports of the form [a, b] U [c, d] with c > b. 

Remark 1.3. Expanding the forward difference, one sees that the operator (1.7) 
yields the same expression as [10, Equation (8)]. Our density approach and 
theirs are not equivalent, as described in Remark 2.1 of [10] (although their 
comparison refers to an older version of the present paper). The differences 
between their assumptions and ours are due to the "difference of a product" 
structure of our operator (1.7). 



1 . 2. Examples 

Theorem 1.1 extends and unifies many corresponding results from the literature, 
as will be shown through the following examples. 

Take p(x) = p\(x) the density of a mean- A Poisson random variable. Then, 
by definition, the class J r+ {p) =■ ^ + (A) is composed of all functions / : Z — > M 
such that (i) x i-> A+ (f(x)p\(x)) is summable over N and (ii) f(0)p\{0) = 
lim-c^oc f{x)p\{x) (which in most cases equals 0). In particular, J r+ (A) con- 
tains the set of bounded functions / such that /(0) = 0, for which simple 
computations show that 

T+f(x) = (^/(z + 1) - /(*)) InOt). 

This operator coincides with that discussed in [10, page 6]. One could also 
consider only functions of the form f(x) = xf (x) for f such that x H> xfo(x) G 
J-" + (A) in which case no restriction on /o(0) (other than that it be finite) is then 
necessary to ensure the required border behavior. Plugging such functions into 
(1.7) and simplifying accordingly we obtain 



f +/(.t) := (A/ (.t + 1) - xfo(x)) l N (x), 



(1.11) 
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which is none other than the standard operator given in (1.3). Most authors 
refer to (1.11) as the Stein operator for the Poisson distribution although there 
are, of course, many more operators for this distribution which can be obtained 
from (1.7). One can, for instance, change the parameterization of the class .F(A) 
through "pre-multiplication" of the form f(x) = c(x)fo(x). See [10] for more on 
this approach. Another way of constructing Stein operators is by making use of 
the backward difference, for which the class F~{p) =■ J r ~(X) is composed of all 
functions / : Z — > R such that (i) x n- A~ (f(x)p\(x)) is summable over N and 
(ii) linxj^oo f(x)p\(x) — 0. Here no border condition at is necessary. For such 
/ the operator becomes, after simplification, 



T x -f(x) = (f(x)-^f(x-l))Mx) 



which is, up to a scaling and a shift, equivalent to the standard operator (1.11). 

Next let p be the density of S„ . the number of white balls added to the Polya- 
Eggenberger urn by time n, with initial state a > 1 white and /3 > 1 black balls. 
We know, e.g. from [10], that 



p(k) =P(S n = k) 



n\ {a) k (f3) n _ k 



for k = 0, . . . , n, with (a;)o = 1 and otherwise (x)k = x(x + 1) • • • (x + k — 1) 
the rising factorial. Writing out the classes ^(p) and the operators (1.7) in all 
generality for these distributions is of little practical or theoretical interest; in 
particular the resulting objects are hard to manipulate (see the discussion in 
[10]). It is much more informative to directly restrict one's attention to specific 
subclasses. For instance it is easy to see that J- + (p) =: J- + (a,f3) contains all 
functions of the form f(x) = xfo(x) with /o bounded and, for these /, the 
operator is of the form 

Likewise J-~{p) =■ J-~(a,f3) contains all functions of the form f(x) = (n — 
x)fo(x) with /o bounded and, for these /, the operator is of the form 

^( a ,p)f( x ) = (i n - x )fo(x) - f (x - l) a + ^_ + » - x)j I[ ,„](a;). 

Of course many variations on the above are imaginable. For instance one could 
also choose to consider functions of the form f{x) = x((3 + n — x)fo(x); plugging 
these into (1.7) yields the operator discussed in [10, equation 7]. 

Thirdly we consider p belonging to the Ord family of distributions, that is 
we suppose that there exist s(x) and t(x) such that 

p{x + 1) s(x) + t(x) 
p(x) s(x + l) 
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with s(a) = (if finite) and s(x) > for a < x < b. For an explanation on 
these notations see [22, equations (11) and (12)]. Writing out the classes T n (p) 
and the operators (1.7) in all generality is again of little practical or theoretical 
interest. Note however that J r+ (p) =■ J- + (s,t) contains all functions / : Z — > R 
which are of the form f(x) = fo{x)s(x) with /o some bounded function. For 
these /, the operator writes out 

f+ T) f(x) = ((s(x) + T(x))f (x + l)-s(x)f (x))I [a>b] (x), 

and we retrieve the operator presented in [22] . Similarly for the backward oper- 
ator we see that J-~(p) =: J r ~(s,r) contains all functions / : Z — s- R such that 
(i) x i ^ f(x)p(x) is bounded over Sp and (ii) lim^h+i f(x)p(x) = 0. For these 
/, the operator writes out 

W« " ('« - , (l -i)tVi) /W ) 

There are, of course, many variations on the approaches presented above. 

Finally choose p with support [0, N] for some N > and represent it as a 
Gibbs measure, that is, write 

P(x) = \o,N]{x) 

with N some positive integer, to > fixed, a function mapping [0,jV] to R 
and V(A;) = — oo for k > N, and 2^ the normalizing constant. This is always 
possible, although there is no unique choice of representation (see [7]). Then 
F^ip) ='■ ^(VjUj) is composed of all functions / : Z — >• R which satisfy the 
summability requirements and such that either /(0)p(0) = (if r) = 1) or 
f(N)p(N) = (if r] = —1). In particular, J 7+ (V,uj) contains functions of the 
form f(x) = xfo(x) with bounded and, for these /, the operator is of the 
form 

= (z V[x+1) - V(x) uh{x + 1) - xfo(x)) I [0 ,n](x); (1.12) 

this corresponds to the Stein operator presented in [7] . Likewise if N < oo then 
F~(V, uj) contains functions of the form f(x) = (N — x)fo(x) with fo bounded 
and, for these /, the operator is of the form 

/ V(x-1)-V(x) \ 

T { v,^f(x) = Uo(x)(N -x)- x(N -x + 1) f (x - 1)J I [0jJV] (x) 

and, if N = oo, then f(x) = fo(x) with /o bounded suffices and the operator 
is equivalent to (1.12). Again a number of other parameterizations of the class 
J-^iVfOj) can be considered, each leading to an alternative form of operator. 
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2. Stein-type identities and information distances 



A classical information-theoretic tool is the relative entropy (a.k.a. Kullback- 
Leibler divergence) 



zkU 



E„ 



log 



(2.1) 



WO, 

which allows for discriminating between any two densities p and q that satisfy 
mild support and differentiability (summability) constraints. Letting g?tv(p, q) 
stand for the total variation distance between p and q (a precise definition is 
given in Section 2.2), the famous Pinsker's inequality 



1 



drv(p,q) < -j=\/ d KL (p\\q) 



(2.2) 



moreover shows that convergence in relative entropy implies other weaker forms 
of convergence (see, e.g., [9]). 

For absolutely continuous distributions with diffcrentiablc density, there ex- 
ists a local version of (2.1) known as the Fisher information distance 



Jn(Y) = E 



( P'(Y) 
\P(Y) 



Y 



(2.3) 



whose properties allow for obtaining elegant optimal-order Gaussian approxi- 
mation bounds. The Fisher information distance (2.3) is a local version of the 
relative entropy in the sense of De Bruijn's identity (see [4]). This (pseudo-) dis- 
tance (i) measures the discrepancy between an arbitrary distribution p and the 
Gaussian distribution N, (ii) enjoys a useful sub-additivity property on convo- 
lutions and (iii) dominates the total variation distance in a sense similar to (2.2) 
(see, e.g., [4, 13, 14]). In [18] we used an extension of Stein's density approach 
for continuous distributions to provide a natural generalization of (2.3) which 
is applicable to any pair of densities (p, q) ; we also show that this generalized 
Fisher information distance satisfies a "local" version of Pinsker's inequality 
in the sense that it also dominates the total variation distance with explicit 
constant not depending on the choice of p and q. 

For discrete distributions there also exist at least two local versions of (2.1) 
which have been put to use in the litterature on Poisson convergence, namely 
the discrete Fisher information 



J{Po(X),q) :=E q 



( M(x-i) 



-X 



(2.4) 



introduced in [3] as a generalization of an information functional presented in 
[15] and the scaled Fisher information 



K{Po{\),q) := AE«j 



(X + l)q(X + l) 
Xq(X) 



(2.5) 
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introduced in [16]. Similarly as for the Fisher information distance, both the 
above (pseudo-)distances (i) measure the discrepancy between the Poisson-A 
distribution and a given density q defined on N, (ii) enjoy a useful sub-additivity 
property on sums of random variables, and (iii) dominate the total variation 
distance (see, e.g., [3, 11, 15, 16, 20, 21]). 

In the sequel we will define general discrete Fisher information distances 
which naturally extend (2.4) and (2.5) and allow for the comparison of an arbi- 
trary pair of distributions (p, q). We will, moreover, show that these generalized 
discrete Fisher information distances satisfy a "local" version of Pinsker's in- 
equality. 



2.1. A Stein-type identity 



We choose to fix, for the sake of convenience, S p = [0, . . . , M] and S q ~ 
[0,...,N], for some integers < N < M < oo. Now suppose that F^ip) H 
■F v (q) 7^ and choose some / in this intersection. Then, for this /, we can write 

vm (2.6) 

= r^f{x)+yf{x)-T^f{x) 

= r \f(x) + - ( AV (f( x )P( x )) _ A"(/(s)g(s)) 
9 • n \ p(x) q(x) 

= Vf{ X ) + f{x + „)i (Ei^l _ ^ + rj)\ I /( ,)I [JV+1 ,..., M] (x). (2.7) 

Next let I : Z -> R be a function such that both E p [l(X)] and E g [Z(X)] exist 
and consider the solution ff' v of the difference (Stein) equation 

7?f(x) = l(x) - E p [l(X)]. (2.8) 

As in the proof of Theorem 1.1 (see identities (1.8) and (1.9)) it is easy to show 
that the solutions to (2.8) are given by 

/f>+ : Z -+ R : x h- ]T(Z(fc) - E p [l(X)})^\ (2.9) 
fc=o p{x> 

for the forward difference (recall that empty sums are set to 0) and 

/f'~ : Z ->K : x h> V(J(fc) - E P [Z(X)])44 ( 2 - 10 ) 

for the backward difference operator. The functions f[' v as defined above triv- 
ially belong to F^ip). To pursue we need the following assumption. 

Assumption A : The distributions p and q are such that the solutions ff ,ri of 
the Stein equation (2.8) satisfy ff' v e F^{p) n .F%) for |?y| = 1. 
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For any given target p it is easy to determine conditions on q and I for Assump- 
tion A to be satisfied. These conditions are not restrictive. Under Assumption A 
we can take expectations with respect to q in (2.6) to obtain from equation (2.8) 
that 

E q [l(X)]-E p [l(X)} 

= E q [T 9 vf"W] +e, [/H* + ^M(*)] - [/hx)v + i,...,m]P0] , 



(2.11) 



where we define 



^(p, ? )(*):=-(^^-^7^). (2-12) 

Since S q fl [AT + 1, ... , M] = 0, the last term in (2.11) vanishes; also we have 
E 9 [Tg'/f ''(A)] = through Theorem 1.1. This yields the following result. 

Lemma 2.1. Take p,q G Q with S q C S p and I : Z — > M a function such that 
Ei p [l(X)] and ~E q [l(X)] exist. Suppose moreover that Assumption A holds. Then 

E q [l(X)}-E p [l(X)}=E q [fr(X + V y'(p,q)(X)}. (2.13) 

Following the terminology from [1. 2, 10] we call (2.13) a Stein (or Stein- 
type) identity. Similarly as its counterpart [18, Lemma 3.2] in the absolutely 
continuous setting, Lemma 2.1 provides the connection between our version 
of the discrete density approach from Theorem 1.1 and discrete information 
inequalities. 



2.2. Local versions of Pinsker's inequality 



A wide variety of probability metrics can be written under the form 

d H {p,q)=su V \E q [l(X)]-E p [l(X)}\ (2.14) 
leu 

for some class of functions %. In particular the total variation distance 




where the suprcmum in the second equality is taken over a set containing one 
single function h{x) := \ {^[ P (x)< q (x)\ ~ \ P (x)> q (x)}) = \ P (x)< q (x)] - \- Other 
distances such as the Kolmogorov, the Wasserstein, the supremum-distance or 
the L 1 -distance are also of the form (2.14) - we refer the reader to [9] for an 
overview. 
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In view of (2.14), it is natural to take suprema on cither side of (2.13) to 
deduce that, whenever Assumption A is satisfied, we have 

d H (p,q)=su 1? \E[f™(X + r 1 )r*>(p,q)(X)]\. (2.16) 
leu 

Equation (2.16) is a very powerful identity. Indeed, roughly speaking, it allows 
to deduce (simply through Cauchy-Schwarz) a wide variety of information in- 
equalities of the form 

dn( P ,q)<^J n (P,q) 

where J^^^p, q) is the second moment of some function proportional to the score 
function r v (p, q) and Knf 1 some finite constant that can be bounded by a value 
depending only on the properties of p (but not on q) as well as on the choice of 
class H (and thus of distance d-u), see, e.g., (2.20) below for r\ = — 1. 

More precisely, we will use (2.16) to identify natural discrete information 
distances which uniformly dominate all probability metrics of the form (2.14) 
through an inequality in which only the constant is distance-dependent. Speci- 
fying the class as in (2.15) we readily see that the resulting inequalities are valid 
for virtually any choice (p, q) ; we therefore contend that their scope is compa- 
rable with that of Pinsker's inequality (2.2), this time for local versions of the 
(discrete) Kullback-Leibler divergence (2.1). 

To illustrate the announcements of the previous paragraphs, we start with 
the backward difference operator obtained for rj = — 1. Identity (2.13) spells out 
as 

E q [l(X)} - E p [l(X)} = E q \fi>-(X - l)r-(p,q)(X)} (2.17) 

with r~ (p, q)(x) = q ^ x ^ — ^"p[xT^ anc ^ with ff' as in (2.10). Taking suprema 
on either side of (2.17) and applying Cauchy-Schwarz we obtain the following. 

Theorem 2.1. Take p,q G Q with S q C S p and such that J-~(p) H J-~{q) ^ 0- 
Let du(p,q) be defined as in (2.14) for some class of functions T-L, and suppose 
that for all I € H the function ff'~ defined in (2.10) exists and satisfies ff'~ £ 
r(j))nr(g). Then 



d-H (P, q) <^H V ^gen (P,q), 

where 



J g en{p,q) := E, 



q(X-l) p(X-l) ^ : 
q(X) p{X) 



(2.18) 



is the generalized discrete Fisher information distance between the densities p 
and q, and 



:=supWE g 



^E, [(/*•-(*-!)) 
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As an application suppose that p and q share the same support. Then we can 
write 

q{x-l) p(x-l) A~p(x) A~q(x) 



q{x) 

so that (2.18) becomes 



p(x) 



p(x) 



q(x) 



Jgen(p,q) = E, 



A-p(X) A~q(X) 



P(X) 



(2.19) 



The distance (2.19) extends the Fisher information distance (2.4) to the com- 
parison of any pair of densities p, q. Taking, in particular, a Poisson target we 
retrieve 



J sen (Po(X),q) =E q 
1 



A-g(X) 



X 



Mix - 1) 



which in turn can be expressed as w — j + I(q) with I(q) = E g 



( A-q(X) 

\ P) 



the functional proposed in [15] and A, a 2 the mean and variance of q. (This 
was also discussed in [3, equation 3.1] in the framework of compound Poisson 
target.) Also, in the particular case of the Poisson distribution, the function 
ff x '~(x — 1)/A is none other than the usual solution of the standard equation 
(1.11) for which we know (see [8, Theorem 2.3]) the estimate 



< i- 



sup — inf 



this is useful when I is bounded as is the case, e.g., for the total variation 
distance. Moreover, this boundedness of f[ x '~ also ensures that Assumption 
A is satisfied whatever q (with support N) we use, hence Theorem 2.1 can be 
applied. Since we always have 



H P U < SUP ||/f' Hoc, 

leu 

we conclude from Theorem 2.1 the information inequality 



(2.20) 



cfav(Po(\), q) < ( 1 - v / a 2 -2A + A 2 /(g). 



Note how if q = p\ then I(q) — j so that a 2 — 2A + X 2 I(q) = 0, as expected. The 
drawback of the information distance (2.18) is, however, that it is potentially 



C. Ley and Y. Swan/Discrete density approach and information inequalities 



13 



infinite in case p and q do not share the same support; in particular in the 
Poisson case the quantity I(q) above is infinite for q with bounded support. To 
avoid this problem one needs to rescale the score function. One way to do so is 
as in the next paragraph. 

Consider the forward difference operator with 77 = 1. Here, of course, we 
have r + (p, q)(x) = — ^T^-p an< ^ fi' + * s °f tne f° rm (2-9), but we opt to 

re-write the product ff' + {x + l)r + (p, q){x) as 



f[' + (x + l) 



p(x + 1) 
p(x) 



1 - 



q(x + l)p(x) 
q(x)p(x + 1) 



The analog of Theorem 2.1 is obtained by yet another simple application of the 
Cauchy-Schwarz inequality to this factorization. 

Theorem 2.2. Take p,q eG with S q C S p and such that T + (p) n T + {q) ^ 0. 
Let du(p,q) be defined as in (2.14) for some class of functions H, and suppose 
that for all I £ % the function /f' + , as defined in (2.9), exists and satisfies 
ff' + € F+{p)C\F+(q). Then 



dn(p,q) < ^ + JlC gcn (j),q), 



(2.21) 



where 



and 



p,+ 

k u := sup , 
leu 



ft + (X + l) 



P(X + 1) 



p(X)q(X + l) 



1 



^p(X + l)q(X) 

is the generalized scaled Fisher information between the densities p and q. 

In the case p = Po(X) we have p\(x + l)/p\(x) = X/(x + 1) so that (2.21) 
becomes 



d H (Po{X),q) < sup 
leu 



sup 

leu 




E„ 



^JlC een (Po{X),q) 



VlC(Po(X),q) 



with IC(Po(X), q) = A/Cg en (Po( A), q) the scaled Fisher information distance 
(2.5). Using a Poincare inequality, [16] show that, for q a discrete distribution 
with mean A, 

d T y(Po(X),q) < v / 2/C(Po(A),< ? ). (2.22) 
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Our Theorem 2.2 allows to improve on this result, through the inequality (see 
again [8, Theorem 2.3]) 



+ 1) 



x + 1 



^ i-V-T Bupi(i)-infK») ; 
u eX I V^sN ieN w 



indeed, this inequality combined with Theorem 2.2 yields (under the appropriate 
and more general conditions than in [16]) 



d TV (Po{X),q) < VX\ 1A ) v//C(Po(A),g). 




(2.23) 



For A < 2/e, we get lAy ^ = 1 and hence the constant in (2.23) is \/A < \/2/e; 

in case A > 2/e, this constant equals y/2/e. In both cases our constants improve 
on those from (2.22). More generally one easily sees that, for instance, in all 
examples considered in [16] our constants are better. 

Remark 2.1. We conclude the paper with a remark on the sub-additivity prop- 
erty of the information distances (2.4) and (2.5). This property follows, in effect, 
from the fact that the score function of which they are a second moment behaves 
nicely over convolutions, as noted for instance by [15, Lemma 2.1] and [16, page 
471]. As it turns out it is easy to see that both properties can also be seen to hold 
in the more general framework of our Lemma 2.1. Indeed let q{x) be the density 
of the sum of two independent random variables X\ and X 2 , with respective 
densities q\ and q 2 . Then, as already noted in [15, Section 2], we get 



q(x + v) T, X j=o ?i(i)?2(a> + 77- j) 



q(x) 



92 (x) 



E 



q 2 (X 2 + rj) 
q2(X 2 ) 



Xi+X 2 



In the case where the target p is a member of Ord's family of distributions 
then A r) p(x)/p(x) is linear in (or inversely proportional to) x so that the score 
function (2.12) is easy to express under the form of a conditional expectation. 
Then taking second moments leads to the conclusion. 
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